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Abstract 

A concurrent object is a data structure shared by concurrent processes. 
Conventional techniques for implementing concurrent objects typically rely 
on critical sections: ensuring that only one process at a time can operate on 
the object. Nevertheless, critical sections are poorly suited for asynchronous 
systems: if one process is halted or delayed in a critical section, other, non- 
faulty processes will be unable to progress. By contrast, a concurrent object 
implementation is non-blocking if it always guarantees that some process will 
complete an operation in a finite number of steps, and it is wait-free if it 
guarantees that each process will complete an operation in a finite number 
of steps. This paper proposes a new methodology for constructing non- 
blocking and wait-free implementations of concurrent objects. The object's 
representation and operations are written as stylized sequential programs, 
with no explicit synchronization. Each sequential operation is automat- 
ically transformed into a non-blocking or wait-free operation using novel 
synchronization and memory management algorithms. These algorithms 
are presented for a multiple instruction/multiple data (MIMD) architecture 
in which n processes communicate by applying read, write, loaddinked, and 
store-conditional operations to a shared memory. 

©Digital Equipment Corporation 1991. All rights reserved. 
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1 Introduction 

A concurrent object is a data structure shared by concurrent processes. Con- 
ventional techniques for implementing concurrent objects typically rely on 
critical sections to ensure that only one process at a time is allowed access to 
the object. Nevertheless, critical sections are poorly suited for asynchronous 
systems; if one process is halted or delayed in a critical section, other, faster 
processes will be unable to progress. Possible sources of unexpected delay 
include page faults, cache misses, scheduling preemption, and perhaps even 
processor failure. 

By contrast, a concurrent object implementation is non-blocking if some 
process must complete an operation after the system as a whole takes a 
finite number of steps, and it is wait-free if each process must complete an 
operation after taking a finite number of steps. The non-blocking condition 
guarantees that some process will always make progress despite arbitrary 
halting failures or delays by other processes, while the wait-free condition 
guarantees that all non-halted processes make progress. Either condition 
rules out the use of critical sections, since a process that halts in a critical 
section can force other processes trying to enter that critical section to run 
forever without making progress. The non-blocking condition is appropriate 
for systems where starvation is unlikely, while the (strictly stronger) wait- 
free condition may be appropriate when some processes are inherently slower 
than others, as in certain heterogeneous architectures. 

The theoretical issues surrounding non-blocking synchronization proto- 
cols have received a fair amount of attention, but the practical issues have 
not. In this paper, we make a first step toward addressing these practical 
aspects by proposing a new methodology for constructing non-blocking and 
wait-free implementations of concurrent objects. Our approach focuses on 
two distinct issues: ease of reasoning, and performance. 

• It is no secret that reasoning about concurrent programs is difficult. 
A practical methodology should permit a programmer to design, say, 
a correct non-blocking priority queue, without ending up with a pub- 
lishable result. 

• The non-blocking and wait-free properties, like most kinds of fault- 
tolerance, incur a cost, especially in the absence of failures or delays. 
A methodology can be considered practical only if (1) we understand 
the inherent costs of the resulting programs, (2) this cost can be kept to 
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acceptable levels, and (3) the programmer has some ability to influence 
these costs. 

We address the reasoning issue by having programmers implement data 
objects as stylized sequential programs, with no explicit synchronization. 
Each sequential implementation is automatically transformed into a non- 
blocking or wait-free implementation via a collection of novel synchroniza- 
tion and memory management techniques introduced in this paper. If the 
sequential implementation is a correct sequential program, and if it follows 
certain simple conventions described below, then the transformed program 
will be a correct concurrent implementation. The advantage of starting with 
sequential programs is clear: the formidable problem of reasoning about 
concurrent programs and data structures is reduced to the more familiar 
sequential domain. (Because programmers are required to follow certain 
conventions, this methodology is not intended to parallelize arbitrary se- 
quential programs after the fact.) 

To address the performance issue, we built and tested prototype im- 
plementations of several concurrent objects on a multiprocessor. We show 
that a naive implementation of our methodology performs poorly because 
of excessive memory contention, but simple techniques from the literature 
(such as exponential backoff) have a dramatic effect on performance. We 
also compare our implementations with more conventional implementations 
based on spin locks. Even in the absence of timing anomalies, our example 
implementations sometimes outperform conventional spin-lock techniques, 
and lie within a factor of two of more sophisticated spin-lock techniques. 

We focus on a multiple instruction/multiple data (MIMD) architecture 
in which n asynchronous processes communicate by applying read, write, 
load Jinked, and store-conditional operations to a shared memory. The 
loadJinked operation copies the value of a shared variable to a local variable. 
A subsequent store-conditional to the shared variable will change its value 
only if no other process has modified that variable in the interim. Either 
way, the store-conditional returns an indication of success or failure. (Note 
that a store ^conditional is permitted to fail even if the variable has not 
changed. We assume that such spurious failures are rare, though possible.) 

We chose to focus on the load-linked and store-conditional synchroniza- 
tion primitives for three reasons. First, they can be implemented efficiently 
in a cache- coherent architectures [9, 25], since store -conditional need only 
check whether the cached copy of the shared variable has been invalidated. 
Second, many other "classical" synchronization primitives are provably in- 
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adequate — we have shown elsewhere [22] that it is impossible 1 to con- 
struct non-blocking or wait-free implementations of many simple and use- 
ful data types using any combination of read, write, test&set, fetch&add 
[18], and memory-to-register swap. The load-linked and store-conditional 
operations, however, are universal — at least in principle, they are power- 
ful enough to transform any sequential object implementation into a non- 
blocking or wait-free implementation. Finally, we have found load-linked and 
store ^conditional easy to use. Elsewhere [23], we present a collection of syn- 
chronization and memory management algorithms based on compare&swap 
[24]. Although these algorithms have the same functionality as those given 
here, they are less efficient, and conceptually more complex. 

In our prototype implementations, we used the C language [27] on an 
Encore Multimax [11] with eighteen NS32532 processors. This architecture 
does not provide load-linked or store-conditional primitives, so we simulated 
them using short critical sections. Naturally, our simulation is less efficient 
than direct hardware support. For example, a successful store-conditional 
requires twelve machine instructions rather than one. Nevertheless, these 
prototype implementations are instructive because they allow us to com- 
pare the relative efficiency of different implementations using load-linked 
and store -conditional , and because they still permit an approximate com- 
parison of the relative efficiency of waiting versus non- waiting techniques. 
We assume readers have some knowledge of the syntax and semantics of C. 

In Section 2, we give a brief survey of related work. Section 3 describes 
our model. In Section 4, we present protocols for transforming sequential 
implementations of small objects into non-blocking and wait-free implemen- 
tations, together with experimental results showing that our techniques can 
be made to perform well even when each process has a dedicated proces- 
sor. In Section 5, we extend this methodology to encompass large objects. 
Section 6 summarizes our results, and concludes with a discussion. 

2 Related Work 

Early work on non-blocking protocols focused on impossibility results [8, 
12, 13, 14, 16, 22], showing that certain problems cannot be solved in asyn- 
chronous systems using certain primitives. By contrast, a synchronization 
primitive is universal ii it can be used to transform any sequential object im- 



1 Although our impossibility results were presented in terms of wait-free implementa- 
tions, they hold for non-blocking implementations as well. 
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plementation into a wait-free concurrent implementation. The author [22] 
gives a necessary and sufficient condition for universality: a synchroniza- 
tion primitive is universal in an re-process system if and only if it solves the 
well-known consensus problem [16] for n processes. Although this result es- 
tablished that wait-free (and non-blocking) implementations are possible in 
principle, the construction given was too inefficient to be practical. Plotkin 
[40] gives a detailed universal construction for a sticky-bit primitive. This 
construction is also of theoretical rather than practical interest. Elsewhere 
[23], the author gives a simple and relatively efficient technique for trans- 
forming stylized sequential object implementations into non-blocking and 
wait-free implementations using the compare&swap synchronization primi- 
tive. Although the overall approach is similar to the one presented here, the 
details are quite different. In particular, the constructions presented in this 
paper are simpler and more efficient, for reasons discussed below. 

Many researchers have studied the problem of constructing wait-free 
atomic registers horn simpler primitives [6, 7, 28, 31, 36, 38, 39, 43]. Atomic 
registers, however, have few if any interesting applications for concurrent 
data structures, since they cannot be combined to construct non-blocking 
or wait-free implementations of most common data types [22]. There exists 
an extensive literature on concurrent data structures constructed from more 
powerful primitives. Gottlieb et al. [19] give a highly concurrent queue imple- 
mentation based on the replace-add operation, a variant of fetch&add. This 
implementation permits concurrent enqueuing and dequeuing processes, but 
it is blocking, since it uses critical sections to synchronize access to individual 
queue elements. Lamport [30] gives a wait-free queue implementation that 
permits one enqueuing process to execute concurrently with one dequeuing 
process. Herlihy and Wing [21] give a non-blocking queue implementation, 
employing fetch&add and swap, that permits an arbitrary number of en- 
queuing and dequeuing processes. Lanin and Shasha [32] give a non-blocking 
set implementation that uses operations similar to compare&swap . There 
exists an extensive literature on locking algorithms for concurrent B-trees 
[4, 33, 42] and for related search structures [5, 15, 17, 20, 26]. Anderson 
and Woll [1] give efficient wait-free solutions to the union-find problem in a 
shared-memory architecture. 

The loadJinked and store -conditional synchronization primitives were 
first proposed as part of the S-l project [25] at Lawrence Livermore Labo- 
ratories, and they are currently supported in the MIPS-II architecture [9]. 
They are closely related to the compare&swap operation first introduced by 
the IBM 370 architecture [24]. 
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3 Overview 

A concurrent system consists of a collection of n sequential processes that 
communicate through shared typed objects. Processes are sequential — each 
process applies a sequence of operations to objects, alternately issuing an 
invocation and then receiving the associated response. We make no fairness 
assumptions about processes. A process can halt, or display arbitrary vari- 
ations in speed. In particular, one process cannot tell whether another has 
halted or is just running very slowly. 

Objects are data structures in memory. Each object has a type, which 
defines a set of possible values and a set of primitive operations that provide 
the only means to manipulate that object. Each object has a sequential 
specification that defines how the object behaves when its operations are 
invoked one at a time by a single process. For example, the behavior of a 
queue object can be specified by requiring that enqueue insert an item in 
the queue, and that dequeue remove the oldest item present in the queue. 
In a concurrent system, however, an object's operations can be invoked by 
concurrent processes, and it is necessary to give a meaning to interleaved 
operation executions. 

An object is linearizable [21] if each operation appears to take effect 
instantaneously at some point between the operation's invocation and re- 
sponse. Linearizability implies that processes appear to be interleaved at the 
granularity of complete operations, and that the order of non-overlapping 
operations is preserved. As discussed in more detail elsewhere [21], the no- 
tion of linearizability generalizes and unifies a number of ad-hoc correctness 
conditions in the literature, and it is related to (but not identical with) 
correctness criteria such as sequential consistency [29] and strict serializ- 
ability [37]. We use linearizability as the basic correctness condition for the 
concurrent objects constructed in this paper. 

Our methodology is the following. 

1. The programmer provides a sequential implementation of the object, 
choosing a representation and implementing the operations. This pro- 
gram is written in a conventional sequential language, subject to cer- 
tain restrictions given below. This implementation performs no ex- 
plicit synchronization. 

2. Using the synchronization and memory management algorithms de- 
scribed in this paper, this sequential implementation is transformed 
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into a non-blocking (or wait-free) concurrent implementation. Al- 
though we do not address the issue here, this transformation is simple 
enough to be performed by a compiler or preprocessor. 

We refer to data structures and operations implemented by the program- 
mer as sequential objects and sequential operations, and we refer to trans- 
formed data structures and operations as concurrent objects and concurrent 
operations. By convention, names of sequential data types and operations 
are in lower-case, while names of concurrent types and operations are capi- 
talized. (Compile-time constants typically appear in upper-case.) 

4 Small Objects 

A small object is one that is small enough to be copied efficiently. In this 
section we discuss how to construct non-blocking and wait-free implemen- 
tations of small objects. In a later section, we present a slightly different 
methodology for large objects, which are too large to be copied all at once. 

A sequential object is a data structure that occupies a fixed-size contigu- 
ous region of memory called a block. Each sequential operation is a stylized 
sequential program subject to the following simple constraints: 

• An sequential operation may not have any side-effects other than mod- 
ifying the block occupied by the object. 

• A sequential operation must be total, meaning that it is well-defined 
for every legal state of the object. (For example, the dequeue operation 
may return an error code or signal an exception when applied to an 
empty queue, but it may not provoke a core dump.) 

The motivation for these restrictions will become clear when we discuss how 
sequential operations are transformed into concurrent operations. 

Throughout this paper, we use the following extended example. A pri- 
ority queue (pqueue_type) is a set of items taken from a totally-ordered 
domain (our examples use integers). It provides two operations: enqueue 
(pqueue_enq) inserts an item into the queue, and dequeue (pqueue_deq) re- 
moves and returns the least item in the queue. A well-known technique for 
implementing a priority queue is to use a heap, a binary tree in which each 
node has a higher priority than its children. Figure 1 shows a sequential 
implementation of a priority queue that satisfies our conditions. 2 . 

2 This code is adapted from [10]. 
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#define PARENT (i) ((i - 1) » 1) 
#define LEFT(i) ((i « 1) + 1) 
#define RIGHT(i) ((i + 1) « 1) 

void pqueue_heapif y (pqueue_type *p, int i){ 
int 1, r, best, swap; 

1 = LEFT(i); 
r = RIGHT(i); 

best = (1 <= p->size && p->elements [1] > p->elements [i] ) ? 1 : i; 

best = (r <= p->size && p->elements [r] > p->elements [best] ) ? r : best; 

if (best != i) { 

swap = p->elements [i] ; 

p->elements [i] = p->elements [best] ; 

p->elements [best] = swap; 

pqueue_heapif y(p, best); 

} 



int pqueue_enq(pqueue_type *p, int x){ 
int i ; 

if (p->size == PQUEUE_SIZE) return PQUEUE_FULL; 
i = p->size++; 

while (i > 0 && p->elements [PAREIT(i)] < x) { 
p->elements [i] = p->elements [PAREIT(i)] ; 
i = PARENT (i) ; 

} 

p->elements [i] = x; 
return PQUEUE_0K; 

} 

int pqueue_deq(pqueue_type *p){ 
int best; 

if (!p->size) return PQUEUE_EMPTY; 
best = p->elements [0] ; 

p->elements [0] = p->elements[ — p->size] ; 
pqueue_heapif y (p, 0); 
return best ; 



Figure 1: A Sequential Priority Queue Implementation 
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4.1 The Non-Blocking Transformation 

We first discuss how to transform a sequential object into a non-blocking 
concurrent object. In this section we present a protocol that guarantees 
correctness, and in the next section we extend the protocol to enhance per- 
formance. 

Omitting certain important details, the basic technique is the following. 
The objects share a variable that holds a pointer to the object's current 
version. Each process (1) reads the pointer using loaddinked , (2) copies the 
indicated version into another block, (3) applies the sequential operation 
to the copy, and (4) calls store-conditional to swing the pointer from the 
old version to the new. If the last step fails, the process restarts at Step 
1. Each execution of these four steps is called an attempt. Linearizability 
is straightforward, since the order in which operations appear to happen is 
the order of their final calls to store-conditional . Barring spurious failures 
of the store-conditional primitive, this protocol is non-blocking because at 
least one out of every n attempts must succeed. 

Memory management for small objects is almost trivial. Each process 
owns single block of unused memory. In Step 2, the process copies the 
object's current version into its own block. When it succeeds in swinging 
the pointer from the old version to the new, it gives up ownership of the 
new version's block, and acquires ownership of the old version's block. Since 
the process that replaces a particular version is uniquely determined, each 
block has a unique and well-defined owner at all times. If all blocks are the 
same size, then support for m small objects requires m + n + 1 blocks. 

A slow process may observe the object in an inconsistent state. For ex- 
ample, processes P and Q may read a pointer to a block b, Q may swing 
the pointer to block b' and then start a new operation. If P copies b while 
Q is copying b' to b, then P's copy may not be a valid state of the sequen- 
tial object. This race condition raises an important software engineering 
issue. Although P's subsequent store -conditional is certain to fail, it may 
be difficult to ensure that the sequential operation does not store into an 
out-of-range location, divide by zero, or perform some other illegal action. 
It would be imprudent to require programmers to write sequential oper- 
ations that avoid such actions when presented with arbitrary bit strings. 
Instead, we insert a consistency check after copying the old version, but 
before applying the sequential operation. Consistency can be checked ei- 
ther by hardware or by software. A simple hardware solution is to include 
a validate instruction that checks whether a variable read by a loaddinked 
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instruction has been modified. Implementing such a primitive in an archi- 
tecture that already supports store .conditional should be straightforward, 
since they have similar functionalities. In our examples, however, we use a 
software solution. Each version has two associated counters, check [0] and 
check [1] . If the counters are equal, the version is consistent. To modify a 
version, a process increments check [0], makes the modifications, and then 
increments check [1]. When copying, a process reads check [1], copies the 
version, and then reads check [0]. Incrementing the counters in one order 
and reading them in the other ensures that if the counters match, then the 
copy is consistent. 3 

This protocol does not work if compare&swap replaces store-conditional . 
Consider the following execution: P and Q each reads a pointer to a block 
b, Q completes its operation, replacing b with b' and acquiring ownership of 
b. Q then completes a second operation, replacing b' with b. If P now does 
a compare&swap , then it will erroneously install an out-of-sequence version. 
Elsewhere [23], we describe a more complex protocol in which P "freezes" a 
block before reading it, ensuring that the block will not be recycled while the 
attempt is in progress. As mentioned above, the resulting protocols are more 
complex and less efficient than the ones described here for store ^conditional . 

Several optimizations are possible. If the hardware provides a validate 
operation, then read-only operations can complete with a successful validate 
instead of a store -conditional . An object may be significantly smaller than a 
full block. If programmers follow a convention where the object's true size is 
kept in a fixed location within the block, then the concurrent operation can 
avoid unnecessary copying. (Our prototypes make use of this optimization). 

We are now ready to review the protocol in more detail (Figure 2). A 
concurrent object is a shared variable that holds a pointer to a structure 
with two fields: (1) version is a sequential object, and (2) check is a two- 
element array of unsigned (large) integers. Each process keeps a pointer 
(new) that points to the block it owns. The process enters a loop. It reads 
the pointer using load Jinked, and marks the new version as inconsistent by 
setting check [0] to check [1] + 1. It then reads the old version's check [1] 
field, copies the version field, and then reads the check [0] field. If the two 
counters fail to match, then the copy is inconsistent, and the process restarts 
the loop. Otherwise, the process applies the sequential operation to the 

3 Counters are bounded, so there is a remote chance that a consistency check will 
succeed incorrectly if a counter cycles all the way around during a single attempt. As 
a practical matter, this problem is avoided simply by using a large enough (e.g., 32 bit) 
counter. 
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typedef struct { 

pqueue_type version; 

unsigned check [2]; 
} Pqueue_type; 

static Pqueue_type *new_pqueue; 

int Pqueue_deq(Pqueue_type **Q){ 

Pqueue_type *old_pqueue; /* concurrent object */ 

pqueue_type *old_version, *new_version; /* seq object */ 
int result; 
unsigned first, last; 

while (1) { 

old_pqueue = load_linked(Q) ; 
old_version = &old_pqueue->version; 
new_version = &new_pqueue->version; 
first = old_pqueue->check [1] ; 
copy (old_version, new_version) ; 
last = old_pqueue->check[0] ; 
if (first == last) { 

result = pqueue_deq(new_version) ; 

if (store_conditional(Q, new_version) ) break; 



} 

} 

new_pqueue = old_pqueue ; 
return result ; 



/* if */ 
/* while */ 



} 



/* Pqueue_deq */ 



Figure 2: Simple Non-Blocking Protocol 
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version field, and then increments check [1] , indicating that the version is 
consistent. It then attempts to reset the pointer using store .conditional. If 
it succeeds, the operation returns; otherwise the loop is resumed. 

4.2 Experimental Results 

The non-blocking property is best thought of as a kind of fault-tolerance. 
In return for extra work (updating a copy instead of updating in place), 
the program acquires the ability to withstand certain failures (unexpected 
process failure or delay). In this section, we present experimental results 
that provide a rough measure of this additional overhead, and that allow 
us to identify and evaluate certain additional techniques that substantially 
enhance performance. We will show that a naive implementation of the 
non-blocking transformation performs poorly, even allowing for the cost of 
simulated load-linked and store-conditional , but that adding a simple expo- 
nential backoff dramatically increases throughput. 

As described above, we constructed a prototype implementation of a 
small priority queue on an Encore Multimax, in C, using simulated load-linked 
and store-conditional primitives. As a benchmark, we measure the elapsed 
time needed for n processes to enqueue and then dequeue 2 20 /n items from 
a shared 16-element priority queue (Figure 3), where n ranges from 1 to 16. 
As a control, we also ran the same benchmark using the same heap imple- 
mentation of the priority queue, except that updates were done in place, 
using an in-line compiled test-and-test-and-set 4 spin lock to achieve mutual 
exclusion. This test-and-test-and-set spin lock is a built-in feature of En- 
core's C compiler, and it represents how most current systems synchronize 
access to shared data structures. 

When evaluating the performance of these benchmarks, it is important 
to understand that they were run under circumstances where timing anoma- 
lies and delays almost never occur. Each process ran on its own dedicated 
processor, and the machine was otherwise idle, ensuring that processes were 
likely to run uninterruptedly. The processes repeatedly accessed a small re- 
gion of memory, making page faults unlikely. Under these circumstances, the 
costs of avoiding waiting are visible, although the benefits are not. Neverthe- 
less, we chose these circumstances because they best highlight the inherent 
costs of our proposal. 

4 A test-and-test-and-set [41] loop repeatedly reads the lock until it observes the lock 
is free, and then tries the test&set operation. 
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#define million 1024 * 1024 
shared Pqueue_type *object; 

int N; /* number of processes */ 

process(){ 

int work = million / N; 
int i ; 

for (i = 0; i < work; i++) 
{ 

Pqueue_enq(obj ect , randomQ); 
Pqueue_deq(obj ect) ; 

} 

} 



Figure 3: Concurrent Heap Benchmark 

In Figure 4, the horizontal axis represents the number of concurrent 
processes executing the benchmark, and the vertical axis represents the time 
taken (in seconds). The top curve is the time taken using the non-blocking 
protocol, and the lower curve is the time taken by the spin lock. When 
reading this graph, it is important to bear in mind that each point represents 
approximately the same amount of work - enqueuing and dequeuing 2 20 
(about a million) randomly-generated numbers. In the absence of memory 
contention, both curves would be nearly flat 5 . 

The simple non-blocking protocol performs much worse than the spin- 
lock protocol, even allowing for the inherent inefficiency of the simulated 
loadJinked and store-conditional primitives. The poor performance of the 
non-blocking protocol is primarily a result of memory contention. In each 
protocol, only one of the n processes is making progress at any given time. 
In the spin lock protocol, it is the process in the critical section, while in the 
non-blocking protocol, it is the process whose store_conditional will eventu- 
ally succeed. In the spin-lock protocol, however, the processes outside the 
critical section are spinning on cached copies of the lock, and are therefore 
not generating any bus traffic. In the non-blocking protocol, by contrast, all 

5 Concurrent executions are slightly less efficient because the heap's maximum possible 
size is a function of the level of concurrency. 
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Figure 5: Simple Non-Blocking Protocol: Number of Attempts 
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Figure 6: Non-Blocking with Backoff: Number of Attempts 

processes are generating bus traffic, so only a fraction of the bus bandwidth 
is dedicated to useful work. 

The simple non-blocking protocol has a second weakness: starvation. 
The enqueue operation is about 10% slower than the dequeue operation. 
If we look at the average number of attempts associated with each process 
(Figure 4.2), we can see that enqueues make slightly more unsuccessful at- 
tempts than dequeues, but that each makes an average of fewer than six 
attempts. If we look at the maximum number of attempts, however, a dra- 
matically different story emerges. The maximum number of unsuccessful 
dequeue attempts is in the high thousands, while the maximum number of 
enqueue hovers around one hundred. This table shows that starvation is 
indeed a problem, since a longer operation may have difficulty completing if 
it competes with shorter operations. 

These performance problems have a simple solution. We introduce an 
exponential backoff [2, 34, 35] between successive attempts (Figure 7). Each 
process keeps a dynamically-adjusted maximum delay. When an operation 
starts, it halves its current maximum delay. Each time an attempt fails, the 
process waits for a random duration less than the maximum delay, and then 
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doubles the maximum delay, up to a fixed limit 6 . 

Exponential backoff has a striking effect on performance. As illustrated 
in Figure 8, the throughput of the non-blocking protocol soon overtakes that 
of the standard spin lock implementation. Moreover, starvation is no longer 
a threat. In the typical execution shown in Figure 4.2, the average number 
of attempts is 1.00 (out of 2 20 operations), and the maximum for enqueues 
is reduced by an order of magnitude. 

As an aside, we point out that it is well-known that spin-locks also benefit 
from exponential backoff [2, 34]. We replaced the in-line compiled test- 
and-test-and-set spin lock with a hand-coded spin lock that itself employs 
exponential backoff. Not surprisingly, this protocol has the best throughput 
of all when run with dedicated processors, almost twice that of the non- 
blocking protocol. 

In summary, using exponential backoff, the non-blocking protocol signif- 
icantly outperforms a straightforward spin-lock protocol (the default pro- 
vided by the Encore C compiler), and lies within a factor of two of a sophis- 
ticated spin-lock implementation. 

4.3 A Wait-Free Protocol 

This protocol can be made wait-free by a technique we call operation combin- 
ing. When a process starts an operation, it records the call in an invocation 
structure (inv_type) whose fields include the operation name (op_name), 
argument value (arg), and a toggle bit (toggle) used to distinguish old and 
new invocations. When it completes an operation, it records the result in a 
response (res_type) structure, whose fields include the result (value) and 
toggle bit. Each concurrent object has an additional field: responses is 
an re-element array of responses, whose P th element is the result of P's last 
completed operation. The processes share an re-element array announce of 
invocations. When P starts an operation, it records the operation name and 
argument in announce [P] . Each time a process records a new invocation, it 
complements the invocation's toggle bit. 

A wait-free enqueue operation appears in Figure 10. After performing 
the consistency check, the apply procedure (Figure 9) scans the responses 
and announce arrays, comparing the toggle fields of corresponding invoca- 
tions and responses. If the bits disagree, then it applies that invocation to 

6 For speed, each process in our prototype uses a precomputed table of random numbers, 
and certain arithmetic operations are performed by equivalent bit-wise logical operations. 
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static int max_delay ; 

int Pqueue_deq(Pqueue_type **Q) 
{ 

Pqueue_type *old_pqueue; 

pqueue_type *old_version, *new_version; 

int i, delay, result; 

unsigned first, last; 

if (max_delay > 1) max_delay = max_delay / 2; 
while (1) { 

old_pqueue = load_linked(Q) ; 

old_version = &old_pqueue->version; 

new_version = &new_pqueue->version; 

first = old_pqueue->check [1] ; 

copy (old_version, new_version) ; 

last = old_pqueue->check[0] ; 

if (first == last) { 

result = pqueue_deq(new_version) ; 

if (store_conditional(Q, new_version) ) break; 



/* backoff */ 

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; 
delay = random () max_delay; 
for (i = 0; i < delay; i++) ; 



} 



/* if */ 



} 

new_pqueue = old_pqueue ; 
return result ; 



/* while */ 



} 



Figure 7: Non-Blocking Protocol with Exponential Backoff 



4 SMALL OBJECTS 



17 



— . 81-1- 



■o Spin-Lock 




9 — 

0-1 1 1 1 1 1 1 1 1 

0 2 4 6 8 10 12 14 16 

Number of Processes 



Figure 8: The Effect of Exponential Backoff 



4 SMALL OBJECTS 



18 



the new version, records the result in the matching position in the responses 
array, and complements the response's toggle bit. After calling the apply 
procedure to apply the pending operations to the new version, the process 
calls store-conditional to replace the old version, just as before. To deter- 
mine when its own operation is complete, P compares the toggle bits of 
its invocation with the object's matching response. It performs this com- 
parison twice; if both comparisons match, the operation is complete. This 
comparison must be done twice to avoid the following race condition: (1) P 
reads a pointer to version v. (2) Q replaces v with v'. (3) Q starts another 
operation, scans announce, applies P's operation to the new value oft;, and 
stores the tentative result in v's responses array. (4) P observes that the 
toggle bits match and returns. (5) Q fails to install v as the next version, 
ensuring that P has returned the wrong result. 

This protocol guarantees that as long as store .conditional has no spuri- 
ous failures, each operation will complete after at most two loop iterations 
7 . If P's first or second store -conditional succeeds, the operation is com- 
plete. Suppose the first store-conditional fails because process Q executed 
an earlier store-conditional , and the second store-conditional fails because 
process Q' executed an earlier store-conditional . Q' must have scanned the 
announce array after Q performed its store-conditional , but Q performed 
its store-conditional after P updated its invocation structure, and therefore 
Q' must have carried out P's operation and set the toggle bits to agree. The 
process applies the termination test repeatedly during any backoff. 

We are now ready to explain why sequential operations must be total. 
Notice that in the benchmark program (Figure 3), each process enqueues an 
item before dequeuing. One might assume, therefore, that no dequeue op- 
eration will ever observe an empty queue. This assumption is wrong. Each 
process reads the object version and the announce array as two distinct 
steps, and the two data structures may be mutually inconsistent. A slow 
process executing an enqueue might observe an empty queue, and then ob- 
serve an announce array in which dequeue operations outnumber enqueue 
operations. This process's subsequent store_conditional will fail, but not 
until the sequential dequeue operation has been applied to an empty queue. 
This issue does not arise in the non-blocking protocol. 

Figure 11 shows the time needed to complete the benchmark program 
for the wait-free protocol. The throughput increases along with concurrency 

7 Because spurious failures are possible, this loop requires an explicit termination test; 
it cannot simply count to two. 
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void apply (inv_type announce [MAX_PR0CS] , Pqueue_type *object){ 
int i ; 

for (i = 0; i < MAX_PR0CS; i++) { 

if (announce [i] . toggle != obj ect->res_types [i] . toggle) { 
switch (announce [i] . op_name) { 
case EIQ_C0DE: 

obj ect->res_types [i] . value = 

pqueue_enq(&obj ect->version, announced] .arg); 
break; 
case DEQ_C0DE : 

obj ect->res_types [i] . value = pqueue_deq(&object->version) ; 
break; 
default : 

fprintf (stderr, "Unknown operation code\n"); 
exit(l) ; 

}; /* switch */ 

obj ect->res_types [i] . toggle = announce [i] . toggle ; 
} /* if */ 

} /* for i */ 



Figure 9: The Apply Operation 

because the amount of copying per operation is reduced. Nevertheless, there 
is a substantial overhead imposed by scanning the announce array, and, more 
importantly, copying the version's responses array with each operation. As 
a practical matter, the probabilistic guarantee against starvation provided 
by exponential backoff may be preferable to the deterministic guarantee 
provided by operation combining. 

5 Large Objects 

In this section, we show how to extend the previous section's protocols to 
objects that are too large to be copied all at once. For large objects, copy- 
ing is likely to be the major performance bottleneck. Our basic premise is 
that copying should therefore be under the explicit control of the program- 
mer, since the programmer is in a position to exploit the semantics of the 
application. 
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static Pqueue_type *new_pqueue; 
static int max_delay ; 

static invocation announce [MAX_PR0CS] ; 

static int P; /* current process id */ 

int Pqueue_deq(Pqueue_type **Q){ 
Pqueue_type *old_pqueue; 
pqueue_type *old_version, *new_version; 
int i, delay, result, new_toggle; 
unsigned first, last; 

announce [P] . op_name = DEQ_C0DE; 

new_toggle = announce [P] . toggle = ! announce [P] . toggle ; 
if (max_delay > 1) max_delay = max_delay >> 1; 
while ( (*Q)->responses [P] .toggle != new_toggle 

II (*Q)->responses [P] .toggle != new_toggle) { 

old_pqueue = load_linked(Q) ; 

old_version = &old_pqueue->version; 

new_version = &new_pqueue->version; 

first = old_pqueue->check [1] ; 

memcpy (old_version, new_version, sizeof (pqueue_type) ) ; 
last = old_pqueue->check[0] ; 
if (first == last) { 

result = pqueue_deq(new_version) ; 

if (store_conditional(Q, new_version) ) break; 
} /* if */ 

/* backoff */ 

if (max_delay < DELAY_LIMIT) max_delay = max_delay « 1; 

delay = random () max_delay; 

for (i = 0; i < delay; i++) ; 
} /* while */ 

new_pqueue = old_pqueue ; 
return result ; 

} 



Figure 10: A Wait-Free Operation 
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Figure 11: Non-Blocking vs. Wait-Free 



5 LARGE OBJECTS 



22 



A large object is represented by a set of blocks linked by pointers. Se- 
quential operations of large objects are written in a functional style: an 
operation that changes the object's state does not modify the object in 
place. Instead, it constructs and returns a logically distinct version of the 
object. By logically distinct, we mean that the old and new versions may in 
fact share a substantial amount of memory. It is the programmer's responsi- 
bility to choose a sequential implementation that performs as little copying 
as possible. 

The basic technique is the following. Each process (1) reads the pointer 
using loadJinked, (2) applies the sequential operation, which returns a pointer 
to a new version, and (3) calls store-conditional to swing the pointer from 
the old version to the new. 

Memory management is slightly more complex. Since an operation may 
require allocating multiple blocks of memory, each process owns its own pool 
of blocks. When a process creates a new version of the object, it explicitly 
allocates new blocks by calling alloc, and it explicitly frees old blocks by 
calling free. The copy primitive copies the contents of one block to another. 
If the attempt succeeds, the process acquires ownership of the blocks it freed 
and relinquishes ownership of the blocks it allocated. 

A process keeps track of its blocks with a data structure called a recov- 
erable set (set_type). The abstract state of a recoverable set is given by 
three sets of blocks: committed, allocated, and freed. The set_f ree oper- 
ation inserts a block in freed, and set_alloc moves a block from commit- 
ted to allocated and returns its address. As shown in figure 12, alloc calls 
set_alloc and marks the resulting block as inconsistent, while free simply 
calls set_free. 

The recoverable set type provides three additional operations, not ex- 
plicitly called by the programmer. Before executing the store -conditional , 
the process calls set_prepare to mark the blocks in allocated as consistent. 
If the store-conditional succeeds, it calls set_commit to set committed to the 
union of freed and committed, and if it fails, it calls set_abort to set both 
freed and allocated to the empty set. 

It might also be necessary for processes to share a pool of blocks. If 
process exhausts its local pool, it can allocate multiple blocks from the 
shared pool, and if it acquires too many blocks, it can return the surplus 
to the shared pool. The shared pool should be accessed as infrequently as 
possible, since otherwise it risks becoming a contention "hot-spot." Some 
techniques for implementing shared pools appear elsewhere [23]; we did not 
use a shared pool in the prototypes shown here. 



5 LARGE OBJECTS 



23 



As in the small object protocol, a process checks for consistency whenever 
it copies a block. If the copy is inconsistent, the process transfers control 
back to the main loop (e.g., using the Unix longjmp). 

5.1 Experimental Results 

For the examples presented in this section, it is convenient to follow some 
syntactic conventions. Because C procedures can return only one result 
value, we follow the convention that all sequential operations return a pointer 
to a result_type structure containing a value field (e.g., the result of a 
dequeue) and a version field (the new state of the object). Instead of 
treating the sequential and concurrent objects as distinct data structures, it 
is convenient to treat the check array as an additional field of the sequential 
object, one that is invisible to the sequential operation. 

A skew heap [44] is an approximately-balanced binary tree in which each 
node stores an item, and each node's item is less than or equal to any item in 
the subtree rooted at that node. A skew heap implements a priority queue, 
and the amortized cost of enqueuing and dequeuing items in a skew heap 
is logarithmic in the size of the tree. For our purposes, the advantage of a 
skew heap over the conventional heap is that update operations leave most 
of the tree nodes untouched. 

The skew_meld operation (Figure 13) merges two heaps. It chooses the 
heap with the lesser root, swaps its right and left children (for balance), 
and then melds the right child with the other heap. To insert item x in h, 
skew_enq melds h with the heap containing x alone. To remove an item 
from h, skew_deq (Figure 14) removes the item at the root and melds the 
root's left and right subtrees. 

We modified the priority queue benchmark of Figure 3 to initialize the 
priority queue to hold 512 randomly generated integers. 

Figure 15 shows the relative throughput of a non-blocking skew heap, a 
spin-lock heap with updates in place, and a spin-lock skew heap with updates 
in place. The non-blocking skew heap and the spin-lock heap are about the 
same, and the spin-lock skew heap has almost twice the throughput of the 
non-blocking skew heap, in agreement with our experimental results for the 
small object protocol. 
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typedef struct 
{ 

int free_ptr, alloc_ptr; /* next full & empty slots */ 

int f ree_ count , alloc_count; /* number of allocs & frees */ 
int size; /* number of committed entries */ 

int old_f ree_ptr , old_alloc_ptr ; /* reset on abort */ 
Skew_type *blocks [SET_SIZE] ; /* pointers to blocks */ 
} set_type; 

Object_type *set_alloc(set_type *q){ 
Object_type *x; 

if (q->alloc_count == q->size) { 

fprintf (stderr, "alloc: wraparound ! \n" ) ; 
exit(l) ; 

} 

x = q->blocks [q->alloc_ptr] ; 

q->alloc_ptr = (q->alloc_ptr + 1) '/„ SET_SIZE; 
q->alloc_count++ ; 
return x; 



void set_commit (set_type *q){ 

q->old_alloc_ptr = q->alloc_ptr ; 
q->old_f ree_ptr = q->free_ptr; 

q->size = q->size + q->f ree_count - q->alloc_count ; 
q->f ree_count = q->alloc_count = 0; 

} 

void set_prepare(set_type *q){ 
int i ; 

for (i = 0; i < q->alloc_count ; i++) 

q->blocks [q->old_alloc_ptr + i] ->check [1] ++ ; 

} 



Object_type *alloc(){ 
Object_type *s; 
s = set_alloc(pool) ; 
s->check[0] = s->check[l] + 1; 
return s ; 



Figure 12: Part of a Recoverable Set Implementation 
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typedef struct skew_rep { 
int value ; 

int toggle; /* left or right next? */ 

struct skew_rep *child[2] ; /* left and right children */ 
int check [2]; /* inserted by system */ 

} Skew_type; 

/* 

Skew_meld assumes its first argument is already copied. 

*/ 

Skew_type *skew_meld(Skew_type *q, *qq){ 
int toggle; 
skew_type *p; 

if ( ! q) return (qq) ; /* if one is empty, return the other */ 

if ( ! qq) return (q) ; 

p = queue_alloc(pool) ; /* make a copy of q */ 

copy(qq, p) ; 
queue_free(pool, qq) ; 
if (q->value > p->value) { 
toggle = q->toggle; 

q->child [toggle] = skew_meld(p, q->child [toggle] ) ; 
q->toggle = ! toggle; 
return q; 
} else { 

toggle = p->toggle; 

p->child [toggle] = skew_meld(q, p->child [toggle] ) ; 
p->toggle = ! toggle; 
return p; 

} 

} 



Figure 13: Skew Heap: The Meld Operation 



5 LARGE OBJECTS 



26 



result_type *skew_deq(Skew_type *q) { 

Skew_type *left, *new_left, *right, buffer; 
static result_type r; 

r. value = SKEW_EMPTY; 
r. version = 0; 
if (q) { 

copy(q, ftbuf f er) ; 
queue_free(pool, q) ; 
r . value = buf f er . value ; 
left = buff er. child [0] ; 
right = buff er. child [1] ; 
if (! left) { 

r. version = right; 
} else { 

new_left = alloc(pool); 

copy (left, new_left); 

queue_free(pool, left); 

r. version = skew_meld(new_lef t , right); 

} 

} 

return &r; 



Figure 14: Skew Heap: The Dequeue Operation 
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Figure 15: Large Heap Throughput 
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6 Conclusions 

Conventional concurrency control techniques based on mutual exclusion 
were originally developed for single-processor machines in which the proces- 
sor was multiplexed among a number of processes. To maximize throughput 
in a uniprocessor architecture, it suffices to keep the processor busy. In a 
multiprocessor architecture, however, maximizing throughput is more com- 
plex. Individual processors are often subject to unpredictable delays, and 
throughput will suffer if a process capable of making progress is unnecessar- 
ily forced to wait for one that is not. 

To address this problem, a number of researchers have investigated wait- 
free and non-blocking algorithms and data structures that do not rely on 
waiting for synchronization. Much of this work has been theoretical. There 
are two obstacles to making such an approach practical: conceptual com- 
plexity, and performance. Conceptual complexity refers to the well-known 
difficulty of reasoning about the behavior of concurrent programs. Any prac- 
tical methodology for constructing highly- concurrent data structures must 
include some mechanism for ensuring their correctness. Performance refers 
to the observation that avoiding waiting, like most other kinds of fault- 
tolerance, incurs a cost when it is not needed. For a methodology to be 
practical, this overhead must be kept to a minimum. 

In the methodology proposed here, we address the issue of conceptual 
complexity by proposing that programmers design their data structures in 
a stylized sequential manner. Because these programs are sequential, both 
formal and informal reasoning are greatly simplified. 

We address the issue of performance in several ways: 

• We observe that the loadJinked and store -conditional synchronization 
primitives permit significantly simpler and more efficient algorithms 
than compare&swap . 

• We propose extremely simple and efficient memory management tech- 
niques. 

• We provide experimental evidence that a naive implementation of a 
non-blocking protocol incurs unacceptable memory contention, but 
that this contention can be eliminated by applying known techniques 
such as exponential backoff. Our prototype implementations (using 
inefficient simulated synchronization primitives) outperform conven- 
tional ("test-and-test-and-set") spin-lock implementations, and lie within 
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a factor of two of more sophisticated (exponential backoff) spin-lock 
implement ations . 

• For large objects, programmers are free to exercise their ingenuity to 
keep the cost of copying under control. Whenever possible, correct- 
ness should be the responsibility of the system, and performance the 
responsibility of the programmer. 

A promising area for future research concerns how one might exploit 
type-specific properties to increase concurrency. Any such approach would 
have to sacrifice some of the simplicity of our methodology, since the pro- 
grammer would have to reason explicitly about concurrency. Nevertheless, 
perhaps one could use our methodology to construct simple concurrent ob- 
jects that could be combined to implement more complex concurrent objects, 
in the same way that B-link [33] trees combine a sequence of low-level atomic 
operations to implement a single atomic operation at the abstract level. 

As illustrated by Andrews and Schneider's comprehensive survey [3], 
most language constructs for shared memory architectures focus on tech- 
niques for managing mutual exclusion. Because the transformations de- 
scribed here are simple enough to be performed by a compiler or prepro- 
cessor, it is intriguing to speculate about a programming language might 
support the methodology proposed here. For example, inheritance might be 
a convenient way to combine the object fields (e.g., check variables) used by 
the run-time system with those introduced by the programmer. Program- 
ming language design raises many complex issues that lie well beyond the 
scope of this paper, but the issue merits further attention. 
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